BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Joins in Biological Databases
نویسندگان
چکیده
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole genome comparison into an approximate join operation in the wellestablished relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step to effectively reduce the search space. BFT reduces the search space and the running time of the join operation drastically. Our empirical results on a number of Prokaryote and Eukaryote DNA contig databases, demonstrate up to 99.9% filtration ratio to efficiently prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram extraction approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole genome comparison. Additionally, we discuss the integration of our proposed techniques for more efficient approximate join in the text databases, data integration, and data cleansing. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.
منابع مشابه
BFT: Bit Filtration Technique for Approximate String Join in Biological Databases
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the wellestablished relational database context. We propose a ...
متن کاملUsing q-grams in a DBMS for Approximate String Processing
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a ...
متن کاملApproximate String Joins in a Database (Almost) for Free
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not suppo...
متن کاملApproximate String Joins
String data is ubiquitous and is commonly used to correlate (or join) entities across autonomous, heterogeneous databases. The main challenge is to effectively deal with the noisy nature of string data, due to, for example, transcription errors, incomplete information, and multiple conventions for recording string valued attributes. Commercial databases do not support approximate string joins d...
متن کاملDevelopment of Deduced Protein Database Using Variable Bit Binary Encoding
A large amount of biological data is semi-structured and stored in any one the following file formats such as flat, XML and relational files. These databases must be integrated with the structured data available in relational or object-oriented databases. The sequence matching process is difficult in such file format, because string comparison takes more computation cost and time. To reduce the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003